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Measurement in Writing 
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This article reviews research examining technical features of curriculum-based measurement (CBM) 
in written expression. Twenty-eight technical reports and published articles are included in this review. 
Studies examining the development and technical adequacy of measures of written expression are sum- 
marized, beginning with research conducted at the Institute for Research on Learning Disabilities at 
the University of Minnesota and followed by extensions of this work. Differences in technical features 
of writing tasks, sample durations, and scoring procedures employed within and across elementary and 
secondary levels are highlighted. Gaps in research addressing the technical adequacy of CBM in writ- 
ten expression are identified, and implications for future research and practice are discussed. 


Progress monitoring has of late been on the agenda of ed- 
ucational policy decision makers and administrators. With 
standards-based reform and school accountability at the fore- 
front of educational policy (e.g.. No Child Left Behind Act of 
2001), it has become clear that if all students are to meet rig- 
orous academic standards, assessment tools are needed to track 
student progress toward those standards and to quickly and 
accurately identify students at risk for failing to reach them. 
Moreover, some have suggested the use of progress monitor- 
ing as part of a nondiscriminatory, response-to-intervention 
approach for special education referral and identification (see 
Fuchs & Fuchs, 2006; Speece, Case, & Molloy, 2003). For 
students receiving special education services, progress moni- 
toring is viewed as a way to uphold major tenets of the Indi- 
viduals with Disabilities Education Improvement Act (IDEIA, 
2004) by aligning goals and objectives on Individualized Ed- 
ucation Programs with performance and progress in the gen- 
eral curriculum (Nolet & McLaughlin, 2000). 

Recently, educators have focused increasing attention 
on monitoring students’ performance and progress in writing. 
This increased attention is, in part, in response to reports of 
high proportions of students who do not meet proficiency lev- 
els in writing. For example, results of the National Assessment 
of Educational Progress 2002 writing assessments indicated 
that 72% of 4th graders, 69% of 8th graders, and 77% of 12th 
graders were performing below a proficient level (National 
Center for Education Statistics, 2003). The emphasis on writ- 
ing performance is also reflected in states’ attempts both to in- 
troduce or revise standards that represent the multifaceted, 


complex nature of the writing process and to implement as- 
sessment procedures that sufficiently measure critical elements 
of this construct (Nolet & McLaughlin, 1997). 

Technically sound measures of writing progress are 
needed to ensure that students are progressing toward writing 
standards, to identify those who struggle, and to inform in- 
struction aimed at improving students’ writing proficiency. One 
of the most extensively researched progress monitoring ap- 
proaches is curriculum-based measurement (CBM; Deno, 
1985). CBM is a procedure in which multiple probes of equiv- 
alent difficulty are administered repeatedly, yielding time- 
series data that reflect student progress. CBM is simple and 
efficient: Brief samples of behavior, such as the number of 
words read correctly in 1 min, correlate strongly with critical 
academic outcomes, such as reading comprehension. Teachers 
can use such data to quickly and accurately establish baseline 
performance, set individual goals, graph student progress, and 
modify instruction when progress is insufficient (Deno, 1985). 
A 30-year program of research has demonstrated CBM’s ca- 
pacity to provide reliable and valid indicators of student per- 
formance and progress in basic skill areas such as reading and 
mathematics (see Foegen, Jiban, & Deno, 2006; Marston, 1989; 
Wayman, Wallace, Wiley, Ticha, & Espin, 2006). 

Purpose 

The purpose of this article is to review the literature on the 
technical adequacy of CBM in written expression. In doing 
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so, we pay particular attention to reliability and validity, which 
are technical qualities required of any measurement tool to be 
used for educational decision making. 

Reliability 

Reliability refers to the precision, accuracy, and consistency 
of a measurement procedure (Thorndike, 2005). With respect 
to the development of progress measures such as CBM, reli- 
ability is important for two specific reasons. First, because 
such measures are used to discriminate among groups of stu- 
dents (e.g., to identify students at risk), it is important to know 
that an individual will maintain his or her standing relative to 
others across testing occasions, alternate forms, and scorers. 
Second, because CBM is used to make individual decisions 
based on progress over time, it is important to know the amount 
of individual variation that can be expected across repeated 
measurements. 

There is not a consensus on criteria by which to judge 
the reliability of measures. Thus, in reporting study findings, 
we can, at best, discuss reliability in relative terms. For ex- 
ample, we can compare coefficients to those found for other 
types of CBM, as well as to other types of writing measures. 
In reading — the most well established domain in CBM — 
reliability coefficients have generally been reported as r > .85 
( Wayman et al., 2006). For standardized writing measures, al- 
ternate-form, and test-retest, reliability estimates have ranged 
from .70 to above .90 (Taylor, 2003). With this information in 
mind, we consider reliability coefficients of r > .80 to be rel- 
atively strong, r = .70 to .80 to be moderately strong, r = .60 
to .70 to be moderate, and r < .60 to be weak. 

Validity 

Reliability is a necessary, but not sufficient, feature of mea- 
sures to be used for educational decision making. The valid- 
ity of a measure — how well it measures what it purports to 
measure — is critical (Thorndike, 2005). The complexity of the 
writing process poses a particular challenge for establishing 
validity. Writing involves several major activities, including 
generating and organizing ideas, translating those ideas into 
written form, and revising the written product (Hayes & Flower, 
1980). These activities require the coordination of a variety 
of processes, including lexical knowledge and retrieval, pho- 
nological and semantic coding, use of syntactic structures (e.g., 
Berninger, 1994), self-monitoring (McCutchen, 1996), and 
ortho-motor skills (Jones & Christensen, 1999). Researchers 
must demonstrate that a brief measure designed for repeated 
administration can serve as a valid indicator of students’ over- 
all writing proficiency, which presumably encompasses all of 
the above processes. 

Criterion validity — how well a measure relates to other 
measures in the same domain — is often the focus of technical 
adequacy studies. But criterion validity is only one aspect of 


validity. In judging the adequacy of a measure to be used for 
educational decision making, the overall construct validity of 
a measure should be considered. Messick (1995) provided a 
useful framework for judging construct validity, stating that 
construct validity should be viewed as a unified concept com- 
prising (a) content validity (representativeness of the domain 
being sampled), (b) substantive validity (reflecting the theo- 
retical rationale underlying the measure), (c) structural va- 
lidity (how well the scoring structure fits with the construct 
being measured), (d) external (convergent and discriminant) 
validity, (e) generalizability (how well scores and interpreta- 
tions generalize across populations, settings, and tasks), and 
(f) consequential validity (implications for educational deci- 
sion making). 

To demonstrate the first four aspects of validity, writing 
tasks should (a) represent the multifaceted nature of the 
writing process (content validity); (b) reflect the variety of 
cognitive processes that writing theorists have indicated are 
important (substantive validity); (c) be scored using proce- 
dures that are not too narrow or too broad such that relevant 
information is overlooked or irrelevant information is included 
(structural validity); and (d) correlate well with comprehen- 
sive writing measures that assess multiple writing domains 
and not correlate well with measures of other constructs, such 
as mathematical problem solving (external validity). 

To demonstrate the generalizability and consequential as- 
pects of validity, CBM writing measures should be seamless 
(useful across a variety of students in general and special ed- 
ucation and students of different ages) and flexible (useful for 
monitoring progress across a variety of curricula found in 
states, districts, and schools). A seamless and flexible prog- 
ress monitoring system allows systematic comparison of 
growth rates under a variety of instructional conditions and 
allows the progress of students to be followed from one year 
to the next, from one setting to the next, and from one curricu- 
lum to the next. Because we believe seamlessness and flexi- 
bility to be important goals in the development of progress 
monitoring tools, we pay particular attention to how writing 
measures function for students of different ages and skill lev- 
els, how well writing measures reflect growth, and whether 
these functions vary by different types of measures. 

Method 

The search for studies of CBM in written expression was part 
of a literature search by the Research Institute on Progress 
Monitoring at the University of Minnesota. Electronic data- 
bases including ERIC, Science Citation Index Expanded, 
Psyclnfo, and Expanded Academic Index were searched using 
the following terms: curriculum-based measurement (or mea- 
sure), general outcome measure, and progress monitoring. 
This yielded 578 articles and reports. Titles and abstracts were 
screened to confirm that they related to CBM, and Methods 
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sections were screened to identify empirical studies, yielding 
160 articles. Articles were grouped by subject (reading, math, 
spelling, and writing); 9 of these addressed writing. In addi- 
tion, 17 technical reports on writing were accessed from the 
Institute for Research on Learning Disabilities (IRLD) at the 
University of Minnesota. An ancestral search of identified stud- 
ies in the initial search yielded 12 additional articles. Studies 
were included if they reported information regarding relia- 
bility and/or any of the six aforementioned aspects of valid- 
ity (n = 28). 

Results and Discussion 

In this section, we begin by summarizing studies conducted 
by the IRLD, as these studies provided the foundation for later 
work. Then, we summarize extensions of this work conducted 
within and across elementary and secondary levels. Studies 
reporting validity and reliability correlations are summarized 
in Table 1 by section (IRLD studies, elementary studies, sec- 
ondary studies, and studies across grade levels) and then in 
chronological order. Note that only ranges of coefficients are 
reported in Table 1 , along with all criterion measures. Because 
of space limitations, we describe in more detail below those 
correlations that are most useful for understanding the tech- 
nical adequacy of the CBM measures. 

IRLD Studies 

Criterion Validity. The first IRLD studies of written 
expression focused on the criterion validity of a number of 
different tasks and scoring procedures (Deno, Mirkin, & Mar- 
ston, 1980; summary published as Deno, Marston, Mirkin, 
1982). Writing tasks included story prompts, topic sentences, 
and picture stimuli to which students responded for 1 to 5 min. 
Responses to each type of task were scored for the number or 
length of '/-units (one main clause plus any attached subordi- 
nate clauses; Hunt 1 965), large words (words with seven or more 
letters), mature words (words not commonly used, as mea- 
sured by the Standard Frequency Index; Finn, as cited in Deno 
et al., 1980), number of words written (WW), words spelled 
correctly (WSC), and correct letter sequences (CLS; any two 
adjacent letters that are correct according to the spelling of the 
word). 

Across studies, validity coefficients were strongest for 3- 
to 5-min samples of writing. Validity coefficients were strongest 
between the Test of Written Language (TOWL; Hammill & 
Larsen, 1978) raw total score and mature words (rs = .76-.88), 
WW (rs = .69-82), and WSC (rs = .7 1-.88; Deno et al., 1980, 
Studies 1 and 2) and between the Developmental Scoring Sys- 
tem (DSS; Lee & Canter, 1971) and WW (rs = ,84-.88), WSC 
(rs = .76-84), and CLS (rs = .78-86). Correlations were sim- 
ilar for each type of prompt (story, picture, topic sentence). 
From this work, it appeared that the number of letters and 
words produced in 3- to 5-min samples provided valid indices 
of writing performance — at least as measured by the TOWL, 


which assesses multiple dimensions of writing using an ana- 
lytic rubric, and the DSS, a measure of syntactic maturity that 
also uses an analytic rubric. 

Videen, Deno, and Marston (1982) extended this work 
by introducing correct word sequences (CWS; any two adja- 
cent, correctly spelled words that are acceptable within the 
context of the sample). Videen et al. wondered whether stu- 
dents might begin generating words that would not add mean- 
ing to their writing but would improve their writing scores if 
only WW and WSC were used to monitor progress. Thus, they 
suggested that CWS might better reflect improvement but still 
maintain ease and efficiency of scoring. Samples from Deno 
et al. (1980) were selected randomly and scored for CWS. 
Weak to moderate correlations for CWS were found with the 
DSS (r = .49) and TOWL (r = .69). Correlations between CWS 
and holistic ratings of the samples were relatively strong (r = 
.85). Correlations were weak (rs = -.03 to .20) between CWS, 
mean 7-units, and Poteet’s checklist (cited by Videen et al.), 
on which samples were rated according to penmanship, spell- 
ing, grammar, and ideation. 

Reliability. IRLD researchers examined several types 
of reliability of written expression measures, including test- 
retest and alternate-form reliability, and internal consistency. 
Most studies also reported interscorer reliability, which was 
generally strong, with coefficients above .90 for most mea- 
sures (Deno et al., 1982; Marston & Deno, 1981, Study 4; 
Marston et al., 1983; Marston, Lowry, Deno, & Mirkin, 1981; 
Tindal, Marston, & Deno, 1983; Videen et al., 1982). 

In terms of test-retest reliability, Marston and Deno (1981, 
Study 1 ) found that WW and CLS written in 5 min had rela- 
tively strong test-retest correlations over a 1-day interval (rs = 
.91 for WW, .81 for WSC, and .92 for CLS) and moderate 
correlations over a 3-week interval (rs = .64 for WW, .62 for 
WSC, and .70 for CLS). Deno, Marston, et al. (1982) exam- 
ined what they termed “growth stability” (reliability from fall 
to spring). Coefficients for first-graders were weak (rs = 
.20-. 47). Coefficients for WSC and CLS were moderate to 
strong for second- through sixth-graders (rs = .60-.86), ex- 
cept for WSC in Grade 3 (r = .37). Tindal, Germann, and 
Deno (1983) reported fall to spring coefficients of r = .56 for 
fifth-graders for both WW and CLS. 

With respect to alternate -form reliability, Marston and 
Deno (1981, Study 2) found that reliability between two 5- 
min story prompts was strong for WW (r = .95), WSC (r = 
.95), and CLS (r = .96). Tindal, Marston, and Deno (1983) 
obtained reliability coefficients ranging from r = .72 for WSC 
to .93 for CLS. Shinn, Ysseldyke, Deno, and Tindal (1982) 
obtained weaker coefficients (rs = .51—71 for WW), as did 
Tindal, Germann, and Deno (1983), who reported reliabilities 
for fourth- and fifth-graders of r = .71 for WW and .70 for 
number of letters. 

Fuchs, Deno, and Marston (1982) aggregated scores 
across alternate forms in an attempt to reduce error associated 

( text continues p. 76) 


TABLE 1. Characteristics of Studies Examining Technical Adequacy of Curriculum-Based Measurement in Written Expression 
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(Minnesota Department of Children, Families, and Learning and NCS Pearson, 2002); GPA = grade point average; EN8, EN9, EN10 = English GPA for 8th, 9th, & 10th grades; SS8, SS9, SS10 = social studies GPA for 8th, 9th, & 
10th grades; THASS = Tindal & Hasbrouck Analytic Scoring System (Tindal & Hasbrouck, 1991); WKCE = Wisconsin Knowledge and Concepts Examination (CTB/McGraw-Hill, 1996). 


76 THE JOURNAL OF SPECIAL EDUCATION VOL. 41/NO. 2/2007 


with a particular stimulus or with a particular point in time 
(e.g., a student might have more to say about one topic than 
another or might be more tired, hungry, anxious, etc., during 
one session than another). Participants responded weekly to 
different story prompts for 10 weeks. Correlations were cal- 
culated between scores on adjacent measures (WSC in Week 
1 and Week 2), across 4 sessions (mean WSC from Weeks 1 
and 3 with mean WSC from Weeks 2 and 4), then across 6, 8, 
and 10 sessions. Aggregations across more days resulted in 
stronger reliability (r = .55 across 2 days, .72 across 4 days, 
.85 across 6 days, .88 across 8 days, and .89 across 10 days). 

Marston and Deno (1981, Study 3) examined internal 
consistency of mature words, WW, WSC, and CLS written in 
five 1-min intervals. Split-half reliability coefficients ranged 
from rs = .96 to .99, and Cronbach’s alpha ranged from rs = 
.70 to .87. Thus, it appears that students’ writing performance 
remained stable within a sample. 

Sensitivity to Growth. In addition to criterion validity 
and reliability, IRLD researchers examined sensitivity of writ- 
ing measures to growth, an important indicator of the valid- 
ity of CBM measures if they are to be used for monitoring 
progress. One approach to examining sensitivity to growth is 
to compare scores for students of different ages and skill lev- 
els. Deno et al. (1980) found that the number of mature words, 
WW, WSC, and CLS successfully differentiated among 
students at different grades, as well as between students with 
and without learning disabilities (LD). Similarly, Shinn et al. 
(1982) found that students with low achievement reliably out- 
performed students with LD on WSC, but that students with 
LD reliably outperformed students with low achievement in 
growth from fall to spring. Shinn et al. cautioned that these 
results were complicated by moderate test-retest reliability 
(rs = .51— .71). 

Marston et al. (1981) and Deno et al. (1982) found that 
WW, WSC, and CLS increased from first to sixth grade and 
from fall to spring within each grade, although not dramati- 
cally, especially at fifth and sixth grades. Similarly, Marston, 
Deno, and Tindal (1983) found significant growth in WW, 
WSC, and CLS across the first through sixth grades, as well 
as within-grade gains across 10 weeks. Growth was not evi- 
dent on the Stanford Achievement Test (SAT; Madden, Gard- 
ner, Rudman, Karlsen, & Merwin, 1978) Language subtest 
over the same 10-week period. Marston et al. (1983) argued 
that direct measures of written expression were more appro- 
priate than were standardized tests for monitoring progress 
over short intervals. 

Summary and Discussion of IRLD Studies. The 
IRLD studies laid the groundwork for developing technically 
sound written expression measures in the following ways. 
First, across third through sixth grades, moderate to strong cri- 
terion validity coefficients were found for countable indices 
of writing. Coefficients were strongest between mature words, 
WW, WSC, and both the TOWL and DSS (rs = ,67-.88). 


Moreover, coefficients did not differ substantially among 
different writing tasks (story prompts vs. topic sentences vs. 
picture stimuli) or for 3- to 5-min samples. Such findings in- 
dicated that valid measures of written expression could be ob- 
tained with brief writing samples and relatively efficient, 
objective scoring procedures. 

IRLD researchers also examined reliability of measures 
across first through sixth grades. Reliability was not as strong 
within grade levels as it was across grade levels (e.g., Tindal, 
Germann, & Deno, 1983). Reliability coefficients were also 
lower for students with LD and for students with low achieve- 
ment, possibly reflecting a range restriction (Marston & Deno, 
1981; Shinn et al., 1982). Poor reliability is problematic, es- 
pecially if the measures are used to identify struggling writ- 
ers. There is some evidence that aggregating scores across 
sessions improves reliability (Fuchs et al., 1982), but aggre- 
gation is also problematic if measures are to be given on a fre- 
quent basis. If six measures are needed to obtain reliable 
information, it might take weeks or even months to determine 
whether a student is making progress, limiting timely in- 
structional decisions. 

IRLD researchers demonstrated that several scoring pro- 
cedures are sensitive to growth (Deno et al., 1980; Deno et al., 
1982; Marston et al., 1981; Shinn et al., 1982). However, most 
examinations of growth were conducted cross-sectionally 
(across grades) or from fall to spring; no one examined the 
technical adequacy of measures for monitoring progress on a 
frequent (e.g., weekly) basis. This remains an important area 
for future development if measures of written expression are 
to be used for progress monitoring. 

Extensions of Research on CBM in 
Written Expression: Elementary Studies 

Several researchers have further examined the technical ade- 
quacy of written expression measures for elementary students 
by examining (a) measures for students at different skill lev- 
els, (b) measures to be used for screening, (c) new scoring pro- 
cedures, and (d) measures for beginning writers. 

Students at Different Skill Levels. Tindal and Parker 
(1991) noted that a better understanding of how writing mea- 
sures function for students at different skill levels was needed. 
They examined criterion validity and sensitivity to growth of 
writing assessments for elementary students of a range of skill 
levels. We found it interesting that in contrast to IRLD stud- 
ies, correlations among WW, WSC, CWS, and analytic scores 
applied to the same writing sample (1 to 5 on story idea, or- 
ganization-cohesion, and conventions-mechanics) were weak to 
moderate (rs = -.02 to .63). 

Statistically significant differences were found between 
students with LD and general education students on all 
measures, between Chapter 1 (at-risk) and general education 
students on most measures, and between students with low per- 
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formance and general education students on some measures — 
indicating that the measures effectively differentiated among 
students of different skill levels. Students also made signifi- 
cant gains in WW, WSC, and CWS from fall to spring. How- 
ever, the authors noted that some students’ writing improved 
in quantity but not quality, while others’ improved in quality 
but not quantity. They concluded that evaluating writing for 
educational decision making would likely require a multifac- 
eted approach. Tindal and Hasbrouck (1991) further exam- 
ined writing samples of students of different ages and skills 
and suggested that whereas quantitative scoring procedures 
appeared more useful for monitoring progress, a qualitative 
scoring approach provided further, complementary informa- 
tion that could be used for making a diagnosis and for tailor- 
ing instruction to address specific needs. 

Screening. To identify suitable screening measures for 
identifying struggling writers, Parker, Tindal, and Hasbrouck 
(1991b, Study 1) scored responses to a story prompt admin- 
istered in fall and spring quantitatively (WW, WSC, CWS, 
%WSC, and %CWS) and qualitatively (using a 7-point hol- 
istic rating of communicative effectiveness). Correlations 
between quantitative and qualitative scores were weak to mod- 
erate at each grade (see Table 1). 

To assess further the utility of quantitative scores for 
screening, Parker et al. (1991b) examined dispersions in the 
bottom 30% to 40% of the score distributions and Standard 
Error of measurement (SEm) bands on percentile line graphs. 
These analyses indicated that %WSC was suitable for second- 
graders; %WSC and %CWS were suitable for third-graders; 
and %WSC, %CWS, and WW were suitable for fourth- 
graders. The remaining indices were not deemed suitable for 
screening due to their failure to distinguish among low per- 
formers. The authors concluded that across second through 
fifth grades, %WSC was the most viable screening tool, given 
its moderate correlation with holistic ratings and suitable dis- 
tribution in the lower ranges. They emphasized that the lack 
of sensitivity to student differences of the other scoring ap- 
proaches could lead to false negatives and recommended cau- 
tion in their use. 

New Scoring Procedures. Gansle, Noell, VanDerHey- 
den, Naquin, and Slider (2002) cited teachers’ dissatisfaction 
with using WW as a primary index of writing, a practice that 
was occurring in schools in which they conducted their re- 
search. To address this issue, they compared WW to a variety 
of new scoring procedures, including number of nouns, verbs, 
and adjectives; long words; WSC; total and correct punc- 
tuation; capitalization; complete sentences; CWS; sentence 
fragments; simple sentences, and computer-scored variables. 
Interscorer reliability ranged from r=. 70 (sentence fragments) 
to .96 (WW). Alternate-form reliability was weak to moder- 
ate, ranging from r = .006 (long words) to .62 (WW). Simi- 
lar to Tindal and Parker’s (1991) findings, criterion validity 
coefficients were weak, with none above .40. Correct punc- 


tuation and CWS accounted for 34% of the variance with 
teacher rankings, 33% to 45% of the variance with the Iowa 
Test of Basic Skills (ITBS; Hoover, Hieronymus, Frisbie, & 
Dunbar, 1996) language subscales, and 16% to 32% of the 
variance with writing subtests on the Louisiana Educational 
Assessment Program (LEAP; Mitzel & Borden, 2000). 

Gansle et al. (2004) then used six “promising” variables — 
as determined by Gansle et al. (2002): WW, total and correct 
punctuation, words in complete sentences, CWS, and total 
simple sentences — to index students’ writing improvement 
following a brief intervention. Participants responded to one 
of two counterbalanced story prompts, received 25 min of in- 
struction on the writing process, and then responded to the 
second prompt. Interscorer agreement was above .90 for all 
scoring indices except simple sentences (r = .78). Validity co- 
efficients with the Woodcock Johnson-Revised (WJ-R; Wood- 
cock & Johnson, 1989) Written Samples subtest were weak 
(/■ = -.05 to r = .42). Only WW improved following instruc- 
tion. In general, findings of these studies lend little support to 
the technical adequacy of either new or existing scoring pro- 
cedures. 

Measures for Beginning Writers. Lembke, Deno, and 
Hall (2003) examined the technical adequacy of measures for 
beginning writing. Lembke et al. examined some new types 
of writing tasks, including word and sentence copying and 
dictation tasks. Scores on these tasks were correlated with two 
types of criterion variables: “atomistic” (discrete, countable in- 
dices, including average WW, WSC, CWS, and correct minus 
incorrect word sequences [CIWS] obtained from a writing 
sample) and “holistic” (teachers’ global ratings of the same 
writing samples). 

Correlations between word copying scores and the atom- 
istic variables were weak to moderate (rs = .10-69). WSC 
and CLS obtained from word dictation appeared to be more 
strongly related to the atomistic variables (rs = .82-92 for av- 
erage WW and WSC; rs = .52-.92 for average CWS and 
CIWS). Moderate correlations were found between WSC and 
CWS on sentence copying and average WW and WSC (rs = 
.74-79). Scores on sentence dictation had a wide range of 
correlations with atomistic variables (rs = .39— .92). Weaker cor- 
relations were associated with average CIWS, and stronger 
correlations were associated with average WW and WSC. 
Most correlations with holistic ratings were weak to moder- 
ate (rs = .06-67) with the exception of WSC on word dicta- 
tion (r = .83) and CIWS on sentence dictation (r = .84). 

Summary and Discussion of Elementary Studies. 

Several researchers have extended IRLD research by exam- 
ining technical features of both existing and new scoring 
procedures. Whereas most researchers reported interscorer re- 
liability, few examined test-retest or alternate-form reliabil- 
ity. The exception was Gansle et al. (2002), who reported 
relatively weak alternate-form reliability coefficients for a va- 
riety of new scoring procedures. The lack of reliability data is 
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problematic, as reliability is a necessary precondition for va- 
lidity (Thorndike, 2005). 

With respect to criterion validity, the results of these stud- 
ies were substantially less positive than were the results of the 
IRLD studies. There are several possible reasons for this. First, 
weak criterion validity might be a function of weak reliabil- 
ity; as mentioned, this information was unavailable for most 
studies. Weaker validity coefficients may also reflect differ- 
ences in study samples. The IRLD studies often included multi- 
grade samples and often reported correlations across grades, 
whereas more recent studies reported correlations within grades. 
The use of multigrade samples may have served to increase the 
range of scores and thus increase reliability and validity co- 
efficients (indeed, when IRLD studies included within-grade 
analyses, correlations were weaker). Continued work is needed 
to develop measures of written expression that have technical 
adequacy within, as well as across, elementary grades. 

Another difference between earlier and later studies is 
the measures used as criterion variables. The strongest valid- 
ity coefficients obtained in the IRLD studies were with the 
TOWL and DSS, which included analytic scoring of a vari- 
ety of writing domains. In later studies, less direct measures of 
written expression were used as criterion measures, such as 
the language subtests of the 1TBS (Gansle et al., 2002). Like- 
wise, holistic ratings used in different studies varied in both 
the range of possible scores, criteria for different ratings, and 
who completed the ratings (e.g., teachers vs. researchers). A 
further complication with holistic ratings is that in some stud- 
ies (Parker et al., 1991b; Tindal & Parker, 1991), they were ap- 
plied to the same writing samples that were scored using CBM 
procedures. Holistic ratings may not be a valid approach for 
evaluating the quality of such brief samples. Applying differ- 
ent scoring procedures to the same writing samples could also 
inflate correlations, as compared to using completely separate 
criterion measures. Of course, these reasons for discrepant 
findings between recent research and IRLD studies are largely 
speculative; what is clear is that continued research is needed 
to determine the best ways to index elementary students’ writ- 
ing performance. 

In addition to raising questions about the reliability and 
criterion validity of CBM writing measures, researchers ex- 
tended the IRLD work in other ways. Tindal and Parker (1991) 
explored differences between quantitative and qualitative 
scoring procedures and suggested that using both approaches 
provides the most useful data for making educational deci- 
sions, thus addressing the structural aspect of validity (Mes- 
sick, 1995). Tindal and Parker and Parker et al. (1991b) also 
found that while measures reliably distinguished among stu- 
dents at different grades or who were served in different ed- 
ucational programs, they were less effective in identifying 
students at risk for writing difficulties, which presents limita- 
tions for screening. Gansle et al. (2004) demonstrated that 
WW improved following a brief writing intervention. These 
studies provide some insight into generalizability (i.e., how well 
measures work across different populations) and consequen- 


tial validity (i.e., utility of measures for educational decision 
making; Messick, 1995). 

Secondary Studies 

Researchers have also extended the study of CBM to address 
writing for middle and high school students. These researchers 
have examined (a) measures integrating reading and writing 
skills, (b) measures for students requiring remedial and spe- 
cial education, (c) measures for screening and monitoring 
progress, (d) more complex scoring procedures, (e) different 
writing tasks and durations, and (f) validity of measures for pre- 
dicting performance on school-based indicators. 

Measures Integrating Reading and Writing Skills. 
Tindal and Parker (1989b) proposed that measures requiring in- 
tegration of basic skills with recall of content area information 
might provide more functional information for secondary- 
level teachers than do traditional CBM measures that treat 
“basic skills in isolation” (p. 329). Thus, they examined how 
well a written retell would relate to other reading and writing 
tasks. Here, we focus on findings related to the writing tasks. 

Middle and high school students’ creative writing sam- 
ples were rated holistically based on communicative effective- 
ness (1 = very poor; 5 = very effective'). Written retells of a 
grade-level reading passage were scored for WW, commu- 
nicative effectiveness, and the number of passage-related idea 
units. On the written retell task, most students retold only 10% 
of the main ideas from the reading passages, and scores 
dropped steadily from one grade to the next, rather than im- 
proving as might be expected. Further, scores on the written 
retells and the writing samples were not significantly corre- 
lated. The authors concluded that written retell skills are dif- 
ferent from creative writing skills and strongly encouraged 
further research exploring other approaches to monitoring 
secondary students’ progress. 

Measures for Students Requiring Remedial and 
Special Education. Tindal and Parker (1989a) found reliable 
differences between students requiring remedial and special 
education on teachers’ holistic ratings, %WSC, %CWS, and 
mean length of correct word sequences (ML/CWS). Watkin- 
son and Lee (1992) produced similar results: Middle school 
students with and without LD differed significantly on CWS, 
incorrect word sequences, %WSC, and %CWS. 

Tindal and Parker (1989a) analyzed intercorrelations 
among scoring procedures and defined two clusters, which 
they identified as “production dependent” (fluency measures 
that relied on length; WW, WSC, and CWS), and “production 
independent” (percentage measures that relied on accuracy; 
%WSC, %CWS, and percent legible words [%LW]). Percent- 
age measures had moderately strong correlation coefficients 
with holistic ratings (rs = .73-.75 for %CWS and %LW), 
whereas fluency variables produced weaker coefficients (rs = 
.10-.59). Tindal and Parker concluded that percentage mea- 


THE JOURNAL OF SPECIAL EDUCATION VOL. 41/NO. 2/2007 79 


sures were more predictive of holistic ratings of the writing 
of students with low performance than were fluency mea- 
sures, but cautioned that percentage measures do not have 
equal interval scales and are thus difficult to interpret when 
trying to distinguish among students at different skill levels. 
Moreover, they are problematic for monitoring progress (e.g., 
if a student produced 10 WSC out of 20 WW in fall, and 50 
WSC out of 100 WW in spring, %WSC would not reflect any 
growth, possibly masking important progress). 

Measures for Screening and Monitoring Progress. 
Parker et al. (1991b, Study 2) further examined fluency and 
percentage indices to identify suitable screening measures for 
6th- through 1 lth-graders. Criterion validity with a 7-point 
holistic rating scale (applied to the same writing sample) was 
weak to moderate within each grade for fluency measures (rs = 
.39-56) and for percentage measures (rs = .34-.46). Stu- 
dents’ scores increased from fall to spring on all measures. 
CWS discriminated better among 8th- and 1 lth-graders and 
students at middle- and high-score ranges than they did among 
students at lower grades and score ranges. Percentage CWS 
appeared most sensitive for discriminating among low scor- 
ers, but lacked precision. The authors cautioned that lack of 
precision is problematic as it increases the likelihood of iden- 
tifying false negatives. 

Parker, Tindal, and Hasbrouck (1991a) examined writ- 
ing samples obtained from struggling middle school writers 
across 6 months. Interscorer reliabilities for all scoring pro- 
cedures were strong (rs = .83-.98). Split-half reliabilities, cal- 
culated by correlating scores from the first and last 3 min of 
each sample, were moderate to strong (rs = .69-81). Only 
WW increased regularly over the 6 months for students in 
each grade. There were no reliable differences across grades 
for any of the writing indices, so data were aggregated to ex- 
amine criterion validity. Correlations between holistic ratings 
and WSC, %WSC, %LW, CWS, and ML/CWS were weak to 
moderate (rs = .43-76). The remaining indices were not suf- 
ficient predictors of holistic ratings. Similar patterns were 
found between the writing indices and the TOWL, although 
coefficients were generally weaker (rs = .15-56). 

Parker et al. (1991a) also conducted both “static” and 
“dynamic” comparisons of different scoring procedures. Sta- 
tic comparisons involved correlating scores obtained at two 
time points to determine growth stability. The most consis- 
tently strong correlations were obtained for WW, LW, and 
WSC, although these varied (rs = .68-83). Dynamic com- 
parisons involved creating profiles for each measure by plot- 
ting standardized mean scores with confidence bands. The 
authors again expressed skepticism regarding the measures; 
whereas some appeared promising in terms of validity, sta- 
bility, or sensitivity to growth, none was adequate in all of 
these areas. CWS, ML/CWS, and %LW appeared promising 
for screening, but not for monitoring progress. Parker et al. 
suggested that further research was needed to examine whether 
greater standardization of writing topics, greater structuring 


of writing tasks, or combining scores from more than one 
writing sample would provide more stable writing indices. 

More Complex Scoring Procedures. Espin, Scierka, 
Skare, and Halverson (1999) explored the utility of combining 
measures and using computerized scoring. They examined the 
writing of four skill groups: students with LD, students in 
basic skills English classes, students in regular English classes, 
and students in enriched English classes. Writing samples 
were typed into a word-processing program, and the grammar- 
check function was used to obtain WW, WSC, characters writ- 
ten, characters per word, and sentences written. CWS and 
ML/CWS were counted manually. Reliability of the measures 
was not examined in this study. Validity coefficients were 
weak to moderate for each scoring procedure (see Table 1). 

The four skill groups differed significantly on characters 
per word, sentences, and ML/CWS, with students who had 
LD performing the lowest, followed by students in the basic, 
regular, and enriched English classes. A multiple regression 
revealed that a combination of characters per word, sentences, 
and ML/CWS provided a better index of writing than did any 
single variable. Espin et al. (1999) concluded that using a 
combination of scores might be necessary to assess secondary 
students’ writing proficiency. Like Parker et al. (1991a), Espin 
et al. suggested that using more than one sample might yield 
stronger criterion validity. At the same time, they emphasized 
the need for continued research to identify the most valid and 
reliable indicators of secondary students’ writing that would 
preserve the simplicity and efficiency of CBM. 

Type of Writing Task and Sample Duration. Espin, 
Shin, Deno, Skare, Robinson, and Benner (2000) addressed 
two issues that had not yet been explored in secondary-level 
writing research. First, because much of secondary-level writ- 
ing is expository, they wondered whether expository samples 
(rather than narrative samples) would better reflect students’ 
writing proficiency. Second, they examined whether duration 
mattered. Participants produced two narrative and two expos- 
itory samples in 5 min each, marking their places at 3 min. 
Alternate-form reliability coefficients for incorrect words, 
ML/CWS, and characters per word were consistently weak 
(rs < .60). For the remaining measures, coefficients were mod- 
erate to strong, especially for WW, WSC, CWS, CIWS, and 
number of characters (all rs > .72). A multiple regression in- 
dicated that CIWS was the only reliable predictor of holistic 
ratings. There were no substantial differences in technical 
adequacy between type (narrative vs. expository) or duration 
(3 min vs. 5 min) of samples. 

Espin, De La Paz, Scierka, and Roelofs (2005) further 
extended research at the secondary level by examining the re- 
lation of CWS and CIWS to the number of functional ele- 
ments and quality ratings of text. They also examined whether 
the length of text affected reliability and validity and whether 
CWS and CIWS would be sensitive to growth. The researchers 
randomly selected pre- and posttest essays that students had 
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written for a writing intervention (De La Paz, 1999). In the 
original study, students had 35 min to write their essays; thus, 
scores were not based on brief writing samples, as they had 
been in previous research. Correlations between WW, CWS, 
and CIWS and criterion measures were moderate to strong 
(rs = .58-90). Espin et al. also calculated validity coefficients 
for the first 50 words of each writing sample, which resulted 
in a notable decrease in the size of correlations (rs = .33-.59). 
It should be noted that because the CBM scoring procedures 
and the criterion measures were applied to the same writing 
samples, these correlations may be somewhat inflated. 

Espin et al. (2005) reported that WW, CWS, and CIWS 
written in 35 min increased reliably over time. When only the 
first 50 words of each sample were examined, increases in 
CWS and CIWS from pre- to posttest were observed for lower 
performers; however, longer samples were needed to detect 
growth in higher performers. Espin et al. concluded that CWS 
and CIWS obtained from expository essays were promising 
indicators of secondary students’ writing proficiency. Further 
support for the measures was provided in that they detected 
growth when a systematic writing intervention was in place 
(De La Paz, 1999). However, strong correlations and sensitiv- 
ity to growth were associated with samples written in 35 min, 
which is much longer than a typical CBM writing sample. 
Espin et al. again emphasized the need for research to iden- 
tify sufficient durations while maintaining ease and efficiency 
of measurement. 

Predicting Performance on School-Based Indicators. 

Fewster and MacMillan (2002) investigated whether middle 
school students’ written expression performance was predic- 
tive of high school performance. District writing CBM data 
were collected from participants’ sixth- or seventh-grade rec- 
ords. Participants were then divided into four groups based on 
their high school educational placement: special education, 
remedial, general education, or honors classes. English and 
social studies grade point averages (GPAs) were collected from 
students’ 8th-, 9th-, or lOth-grade records. Correlations be- 
tween middle school writing and high school GPAs were rel- 
atively weak (/ s = . 1 6— .34). This finding might have been due 
to differences in educational programming across grades, dif- 
ferences in grading standards among teachers, or differences 
in length of time between CBM administration and GPAs 
awarded for different students. A discriminant analysis indi- 
cated that CBM scores reliably distinguished among students 
in special education, remedial classes, general education, and 
honors classes, suggesting that the measures had utility for 
screening. 

Espin, Wallace, Campbell, Lembke, Long, and Ticha 
(2006) examined the use of writing measures for gauging in- 
dividual performance and progress toward state standards. Al- 
ternate-form reliability coefficients for WW, WSC, CWS, and 
CIWS produced in 3, 5, 7, and 10 min were moderate to strong 
(/ s = .64-.85) and appeared to strengthen with duration, with 
coefficients above r = .70 for 5-min samples and above r = 


.80 for 7- and 10-min samples. Criterion validity with a holis- 
tically scored state writing test was weak to moderate (rs = 
.23-60). Overall, CIWS obtained from 7-min samples ap- 
peared to have the strongest reliability and validity, although 
coefficients for 5-min samples were also deemed acceptable. 
The researchers used the 7-min samples to demonstrate how 
to construct “Tables of Probable Success,” which used logis- 
tic regression to estimate the chances of passing the state stan- 
dards test. 

Summary and Discussion of Secondary Studies. In 
terms of reliability of measures of secondary-level writing, in- 
ternal consistency (Parker et al., 1991a) and alternate-form 
reliability (Espin et al., 2000, 2005) of various scoring pro- 
cedures generally appeared sufficient, with most rs > .70. 
With respect to criterion validity, however, the following 
seems clear: Simple, countable indices such as WW and WSC 
are not sufficient for assessing secondary students’ writing. 
Tindal and Parker (1989a) and Parker et al. (1991a) found that 
for middle school remedial and special education students, 
percentage measures were better predictors of holistic ratings 
of writing ( rs = .73 to .75) than were fluency measures (rs = 
.10 to .59). Moreover, percentage measures more reliably dis- 
tinguished between students with and without LD than did 
fluency measures (Tindal & Parker, 1989a; Watkinson & Lee, 
1992). However, while several measures appeared promising 
for screening, fine-grained analyses indicated that none was 
sufficiently sensitive to differences among low scorers, rais- 
ing the concern of identifying false negatives (Parker et al., 
1991b), and none was sufficient for monitoring progress (Parker 
et al., 1991a). 

Espin and colleagues demonstrated that more complex 
measures, such as a combination of measures (Espin et al., 
1999) or CIWS (Espin et al., 2000) might be needed. The lat- 
ter is a promising finding because unlike percentage indices, 
CIWS is more viable for detecting growth, and unlike com- 
binations of variables, it is likely to be less time consuming to 
score and interpret. Espin et al. (2000) also demonstrated that 
validity of CIWS does not appear to depend on type of writ- 
ing prompt (narrative vs. expository) or sample duration (3 min 
vs. 5 min) for middle school students. However, Espin, De La 
Paz, et al. (2005) found that longer samples (35 min) yielded 
stronger validity than did 50-word samples. Further, Espin et 
al. (2006) demonstrated that for lOth-graders, validity of 
CIWS strengthened when the duration was increased to 7 min. 

Espin et al. (2006) developed a procedure using Tables 
of Probable Success that could be used to predict students’ 
probability of passing state standards tests based on CBM per- 
formance. Such a tool might be useful for identifying students 
at risk for failing state tests, setting reasonable goals, provid- 
ing instructional accommodations, and monitoring student 
progress toward higher probabilities of passing. Research is 
needed to examine the validity of data generated by Tables of 
Probable Success and to determine whether schools’ use of 
such a tool would lead to improved student outcomes. Such 
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research, along with research examining the utility of mea- 
sures for monitoring writing progress on a frequent basis to 
make instructional decisions, would shed further light on the 
consequential validity of secondary writing measures. 

Studies Across Grade Levels 

Results of studies reviewed thus far suggest that different 
scoring indices might be needed at different grade levels, rais- 
ing questions about the “seamlessness” of writing measures 
across students of different ages. Secondary-level studies re- 
vealed that more complex scoring procedures had stronger 
technical characteristics than did simple scoring procedures. 
Elementary-level research is less conclusive. It is unclear 
whether simple indices such as WW and WSC are sufficient 
(as suggested by IRLD studies) or not sufficient (as suggested 
by more recent studies). To understand better whether differ- 
ent measures are needed at different levels, several researchers 
have compared the technical adequacy of measures across 
grade levels. 

Malecki and Jewell (2003) observed that many districts 
were collecting normative writing data only on simple fluency 
measures (WW, WSC, CWS) and noted that whereas there is 
some evidence of the utility of these measures for elementary 
students (primarily from IRLD work), researchers have shown 
stronger technical adequacy of percentage measures (%WSC 
or %CWS; Tindal & Parker, 1989a) or CIWS (e.g„ Espin et 
al., 2000) for older students. Malecki and Jewell investigated 
which indices were most appropriate at different grade levels. 
They also examined potential gender differences, which might 
be important for developing district norms. 

Malecki and Jewell (2003) found reliable differences be- 
tween grade levels on fall fluency and percentage measures, 
including CIWS, with sixth- through eighth-graders scoring 
higher than third- through fifth-graders did, who scored higher 
than first- through second-graders did. Girls reliably outper- 
formed boys, and this gap widened over time on CWS and 
CIWS. For %WSC, girls outperformed boys in first through 
second grade, but the gap closed at later grades. At all grades, 
WW, WSC, CWS, and CIWS improved significantly from fall 
to spring, with no grade by time interactions. For the percent- 
age measures, only first- and second-graders’ scores showed 
reliable improvement. 

Jewell and Malecki (2005) examined criterion validity 
of the fluency and percentage indices described above. Stu- 
dents at higher grades reliably outperformed students at lower 
grades with the exception of fourth- and sixth-graders on per- 
centage measures. For second- and fourth-graders, weak to 
moderate correlations were found between most CBM scores 
and the SAT language subtests (rs = .34— .67) and between 
percentage scores, CIWS, and the SAT spelling subtest (rs = 
.43-56). For sixth-graders, positive but relatively weak cor- 
relations were found between SAT subtests, percentage mea- 
sures, and CIWS (rs = .41— .52). Language arts grades were 
weakly to moderately correlated with all scoring indices for 


fourth-graders (rs = .45-61), but only with %WSC and CIWS 
for sixth-graders (.45 and .36, respectively). Finally, the ana- 
lytic scoring system was weakly correlated with all scoring 
indices for second- and fourth-graders (rs = .34-58) and with 
CWS, %WSC, %CWS, and CIWS for sixth-graders (rs = 
.33-52). 

Based on these results, Jewell and Malecki (2005) con- 
cluded that simple measures, such as WW and WSC, become 
less valid as grade level increases. They suggested using per- 
centage measures or CIWS with middle school students. They 
also noted that percentage measures and CIWS were more 
strongly related to criterion measures at all grades, a finding 
consistent with previous research (Tindal & Parker, 1989a). 
They further cautioned that none of the validity coefficients 
was overwhelmingly strong, which was also consistent with 
other findings at elementary and secondary levels (e.g., Espin 
and colleagues; Gansle et al., 2002; Tindal & Parker). Jewell 
and Malecki ’s overall conclusion was that it is critical to con- 
sider students’ gender and grade, as well as the purpose of as- 
sessment, when deciding which measures to use. 

Weissenburger and Espin (2005) examined reliability and 
validity of narrative writing prompts across 4th, 8th, and 10th 
grades. Alternate-form reliability was moderate to strong at 
each grade for WW (rs = ,55-.84), CWS (rs = ,59-.84), and 
CIWS (rs = .6 1— .82). Correlations were stronger for longer writ- 
ing samples and weaker at higher grades. Criterion validity of 
the measures with the Language Arts Normal Curve Equivalent 
(NCE) scores from a statewide test was weak at each grade 
level for WW (rs = .04-45) and slightly stronger at Grades 
4 and 8 for CWS (rs = ,47-.62) and CIWS (rs = .60-.68). Va- 
lidity of CWS and CIWS was weak at Grade 10 (rs = . 1 8— .36). 
Similarly, coefficients with holistic writing scores from the 
statewide test were weak for WW (rs = .33-48) and moder- 
ate for CWS and CIWS (rs = .50-65) for 4th- and 8th-graders 
(data were not available for lOth-graders). Although increased 
duration led to increased alternate-form reliability, it gener- 
ally did not strengthen the validity of the measures. 

Findings from the above studies suggest that criterion 
validity of CBM in written expression decreases as students 
get older. However, in these studies, only narrative samples were 
used. Because older students are often required to produce 
more expository than narrative writing (e.g., Deshler, Ellis, & 
Lenz, 1996), the validity of expository prompts warrants fur- 
ther investigation. Also, although the validity of measures ad- 
ministered in the Weissenburger and Espin (2005) study did 
not increase substantially with time, previous research has in- 
dicated that longer samples do increase validity of writing 
scores (e.g., Espin et al., 2005; Espin et al., 2006). The effect 
of duration on the criterion validity of writing samples should 
also be further explored. 

Implications for Future Research 

An extensive amount of work has been done to identify tech- 
nically sound approaches to assessing written expression within 
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a CBM framework, that is, using measures that are simple and 
efficient to obtain reliable and valid indicators of overall writ- 
ing proficiency. This work has provided a foundation for fur- 
ther research on monitoring writing progress. It is clear that 
much work is needed to develop seamless and flexible writ- 
ing measures to be used within a system of accountability, 
whereby students at risk for failing to meet standards are iden- 
tified, intervention effectiveness is evaluated, and student prog- 
ress within and across grade levels is monitored. 

Reliability 

Most of the research on CBM in written expression has re- 
ported reliability of static (one point in time) measures or sta- 
bility across a wide timeframe (e.g., fall to spring). However, 
an important goal is to develop measures that can be used to 
monitor student progress on a frequent basis to facilitate in- 
structional decisions. This is an area wide open for further in- 
vestigation. For example, possible variation in scores obtained 
from different writing prompts has particular implications for 
future research addressing the use of measures for monitor- 
ing writing progress. There might be considerable variability 
in the interest and background knowledge that students bring 
to different writing prompts, which could affect the quality 
and quantity of their responses and could, in turn, lead to sub- 
stantial bounce from one measurement point to the next. None 
of the studies in this review addressed reliability of slopes, yet 
this is a critical component of research needed in the devel- 
opment of progress-monitoring tools. 

Validity 

Procedures developed thus far have yielded more modest cri- 
terion validity coefficients than have those obtained in other 
areas of CBM research. In fact, recent research at both ele- 
mentary and secondary levels has yielded only moderate 
criterion validity at best. This may be a result of criterion mea- 
sures that do not directly assess written expression, such as 
language subtests of standardized measures, poorly constructed 
criterion measures (other writing measures that have ques- 
tionable technical adequacy themselves), or varying criteria 
for holistic ratings that tend to have only moderate interscorer 
reliability. 

Modest validity coefficients might also be a reflection of 
the difficulty associated with measuring the complex, multi- 
faceted construct of writing proficiency. Although validity co- 
efficients for CBM writing measures have generally been 
lower than those seen for CBM measures in other academic 
areas, coefficients in many of the studies in this review are sim- 
ilar to or better than those seen for other commonly used mea- 
sures of written expression. For example, the criterion-related 
validity reported for the TOWL-3 (Hammill & Larsen, 1996) 
ranges from .34 to .68 for the various subtests (with only the 


spelling subtest above .60). Given the general difficulties as- 
sociated with measuring writing proficiency in evaluating the 
technical adequacy of CBM, it is important to search for con- 
sistent findings across multiple criterion measures that tap 
various aspects of this construct (Messick, 1995). Further, it is 
important to examine divergent validity, that is, whether CBM 
writing measures correlate more strongly with other writing 
measures than they do with measures in other domains, such 
as reading or math. 

Two critical aspects of validity, discussed earlier in this 
review, are the generalizability and consequential validity of 
measures, which are necessary if writing measures are to be 
used within a seamless and flexible system of progress mon- 
itoring. In terms of generalizability, much work in written ex- 
pression has focused on students in general education or those 
with high-incidence disabilities; whether the materials and 
procedures are appropriate for other populations of students, 
such as English learners and those with significant disabili- 
ties who access the general education curriculum, is not yet 
well understood. With respect to consequential validity, to fa- 
cilitate educational decision making within and across grades, 
research is needed to determine which measures are most ap- 
propriate at which grade levels and to establish methods to 
connect student progress both within and across grades. Fi- 
nally, if measures are used by teachers to monitor progress 
and make instructional decisions, it is necessary to demon- 
strate that student performance improves as a result. 

Implications for Practice 

Clearly, much work remains to identify the most useful mea- 
sures for monitoring students’ writing progress. Yet, it is evi- 
dent that teachers, schools, and districts are already using such 
measures (Fewster & MacMillan, 2002; Gansle et al., 2002; 
Malecki & Jewell, 2003) and that WW or WSC is often used 
as the primary index of writing proficiency. Educators should 
use caution in interpreting results of these measures. There is 
some evidence that simple, countable indices of written ex- 
pression are useful for screening (Parker et al., 1991a, 1991b; 
Watkinson & Lee, 1992), and percentage measures appear to 
be more technically sound for this purpose than do fluency mea- 
sures. There is also evidence that, at least at the higher grades, 
more complex scoring procedures, such as CIWS, are more 
technically sound (e.g., Espin et al., 2000; Jewell & Malecki, 
2005). Further, for instructional decision making, educators 
might wish to consider qualitative, as well as quantitative, as- 
pects of students’ writing (Tindal & Hasbrouck, 1991). Fi- 
nally, educators should keep an eye toward the research for 
further development of progress-monitoring approaches in 
written expression. It is our hope that upcoming research will 
lead to great improvements in the technical soundness and in- 
structional utility of CBM in written expression, eventually 
leading to a seamless and flexible system for monitoring stu- 
dent progress in writing. 
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